Posts with tag Data exploration and visualization
Back to all postsAs promised, some quick thoughts broken off my post on Dunning Log-likelihood. There, I looked at _big_ corpuses–two history classes of about 20,000 books each. But I also wonder how we can use algorithmic comparison on a much smaller scale: particularly, at the level of individual authors or works. English dept. digital humanists tend to rely on small sets of well curated, TEI texts, but even the ugly wilds of machine OCR might be able to offer them some insights. (Sidenote–interesting post by Ted Underwood today on the mechanics of creating a middle group between these two poles).
Historians often hope that digitized texts will enable better, faster comparisons of groups of texts. Now that at least the 1grams on Bookworm are running pretty smoothly, I want to start to lay the groundwork for using corpus comparisons to look at words in a big digital library. For the algorithmically minded: this post should act as a somewhat idiosyncratic approach to Dunning’s Log-likelihood statistic. For the hermeneutically minded: this post should explain why you might need _any_ log-likelihood statistic.
I mentioned earlier I’ve been rebuilding my database; I’ve also been talking to some of the people here at Harvard about various follow-up projects to ngrams. So this seems like a good moment to rethink a few pretty basic things about different ways of presenting historical language statistics. For extensive interactions, nothing is going to beat a database or direct access to text files in some form. But for quick interactions, which includes a lot of pattern searching and public outreach, we have some interesting choices about presentation.
Let’s start with two self-evident facts about how print culture changes over time:
- The words that writers use change. Some words flare into usage and then back out; others steadily grow in popularity; others slowly fade out of the language.
- The writers using words change. Some writers retire or die, some hit mid-career spurts of productivity, and every year hundreds of new writers burst onto the scene. In the 19th-century US, median author age stays within a few years of 49: that constancy, year after year, means the supply of writers is constantly being replenished from the next generation.
One of the most important services a computer can provide for us is a different way of reading. It’s fast, bad at grammar, good at counting, and generally provides a different perspective on texts we already know in one way.
And though a text can be a book, it can also be something much larger. Take library call numbers. Library of Congress headings classifications are probably the best hierarchical classification of books we’ll ever get. Certainly they’re the best human-done hierarchical classification. It’s literally taken decades for librarians to amass the card catalogs we have now, with their classifications of every book in every university library down to several degrees of specificity. But they’re also a little foreign, at times, and it’s not clear how well they’ll correspond to machine-centric ways of categorizing books. I’ve been playing around with some of the data on LCC headings classes and subclasses with some vague ideas of what it might be useful for and how we can use categorized genre to learn about patterns in intellectual history. This post is the first part of that.
Genre information is important and interesting. Using the smaller of my two book databases, I can get some pretty good genre information about some fields I’m interested in for my dissertation by using the Library of Congress classifications for the books. I’m going to start with the difference between psychology and philosophy. I’ve already got some more interesting stuff than these basic charts, but I think a constrained comparison like this should be somewhat more clear.
I’ll end my unannounced hiatus by posting several charts that show the limits of the search-term clustering I talked about last week before I respond to a couple things that caught my interest in the last week.
Because of my primitive search engine, I’ve been thinking about some of the ways we can better use search data to a) interpret historical data, and b) improve our understanding of what goes on when we search. As I was saying then, there are two things that search engines let us do that we usually don’t get:
More access to the connections between words makes it possible to separate word-use from language. This is one of the reasons that we need access to analyzed texts to do any real digital history. I’m thinking through ways to use patterns of correlations across books as a way to start thinking about how connections between words and concepts change over time, just as word count data can tell us something (fuzzy, but something) about the general prominence of a term. This post is about how the search algorithm I’ve been working with can help improve this sort of search. I’ll get back to evolution (which I talked about in my post introducing these correlation charts) in a day or two, but let me start with an even more basic question that illustrates some of the possibilities and limitations of this analysis: What was the Civil War fought about?
I finally got some call numbers. Not for everything, but for a better portion than I thought I would: about 7,600 records, or c. 30% of my books.
The HathiTrust Bibliographic API is great. What a resource. There are a few odd tricks I had to put in to account for their integrating various catalogs together (Michigan call numbers are filed under MARC 050 (Library of Congress catalog), while California ones are filed under MARC 090 (local catalog), for instance, although they both seem to be basically an LCC scheme). But the openness is fantastic–you just plug in OCLC or LCCN identifiers into a url string to get an xml record. It’s possible to get a lot of OCLCs, in particular, by scraping Internet Archive pages. I haven’t yet found a good way to go the opposite direction, though: from a large number of specially chosen Hathi catalogue items to IA books.
Back to my own stuff. Before the Ngrams stuff came up, I was working on ways of finding books that share similar vocabularies. I said at the end of my second ngrams post that we have hundreds of thousands of dimensions for each book: let me explain what I mean. My regular readers were unconvinced, I think, by my first foray here into principal components, but I’m going to try again. This post is largely a test of whether I can explain principal components analysis to people who don’t know about it so: correct me if you already understand PCA, and let me know me know what’s unclear if you don’t. (Or, it goes without saying, skip it.)
I’m interested in the ways different words are tied together. That’s sort of the universal feature of this project, so figuring out ways to find them would be useful. I already looked at some ways of finding interesting words for “scientific method,” but that was in the context of the related words as an endpoint of the analysis. I want to be able to automatically generate linked words, as well. I’m going to think through this staying on “capitalist” as the word of the day. Fair warning: this post is a rambler.
Dan asks for some numbers on “capitalism” and “capitalist” similar to the ones on “Darwinism” and “Darwinist” I ran for Hank earlier. That seems like a nice big question I can use to get some basic methods to warm up the new database I set up this week and to get some basic functionality written into it.
This verges on unreflective datadumping: but because it’s easy and I think people might find it interesting, I’m going to drop in some of my own charts for total word use in 30,000 books by the largest American publishers on the same terms for which the Times published Cohen’s charts of title word counts. I’ve tossed in a couple extra words where it seems interesting—including some alternate word-forms that tell a story, using a perl word-stemming algorithm I set up the other day that works fairly well. My charts run from 1830 (there just aren’t many American books from before, and even the data from the 30s is a little screwy) to 1922 (the date that digital history ends–thank you, Sonny Bono.) In some cases, (that 1874 peak for science), the American and British trends are surprisingly close. Sometimes, they aren’t.
So I just looked at patterns of commemoration for a few famous anniversaries. This is, for some people, kind of interesting–how does the publishing industry focus in on certain figures to create news or resurgences of interest in them? I love the way we get excited about the civil war sesquicentennial now, or the Darwin/Lincoln year last year.
I was starting to write about the implicit model of historical change behind loess curves, which I’ll probably post soon, when I started to think some more about a great counterexample to the gradual change I’m looking for: the patterns of commemoration for anniversaries. At anniversaries, as well as news events, I often see big spikes in wordcounts for an event or person.
p.p1 {margin: 0.0px 0.0px 0.0px 0.0px; font: 12.0px Helvetica} p.p2 {margin: 0.0px 0.0px 0.0px 0.0px; font: 12.0px Helvetica; min-height: 14.0px}
What are the most distinctive words in the nineteenth century? That’s an impossible question, of course. But as I started to say in my first post about bookcounts, [link] we can find something useful–the words that are most concentrated in specific texts. Some words appear at about the same rate in all books, while some are more highly concentrated in particular books. And historically, the words that are more highly concentrated may be more specific in their meanings–at the very least, they might help us to analyze genre or other forms of contextual distribution.
All right, let’s put this machine into action. A lot of digital humanities is about visualization, which has its place in teaching, which Jamie asked for more about. Before I do that, though, I want to show some more about how this can be a research tool. Henry asked about the history of the term ‘scientific method.’ I assume he was asking a chart showing its usage over time, but I already have, with the data in hand, a lot of other interesting displays that we can use. This post is a sort of catalog of what some of the low-hanging fruit in text analysis are.
I now have counts for the number of books a word appears in, as well as the number of times it appeared. Just as I hoped, it gives a new perspective on a lot of the questions we looked at already. That telephone-telegraph-railroad chart, in particular, has a lot of interesting differences. But before I get into that, probably later today, I want to step back and think about what we can learn from the contrast between between wordcounts and bookcounts. (I’m just going to call them bookcounts–I hope that’s a clear enough phrase).
Here’s what googling that question will tell you: about 400,000 words in the big dictionaries (OED, Webster’s); but including technical vocabulary, a million, give or take a few hundred thousand. But for my poor computer, that’s too many, for reasons too technical to go into here. Suffice it to say that I’m asking this question for mundane reasons, but the answer is kind of interesting anyway. No real historical questions in this post, though–I’ll put the only big thought I have about it in another post later tonight.